import pandas as pd
import numpy as np
L11: Dates, lags, sorting
= pd.read_excel('./rawdata12.xlsx')
df df
firmid | date | return | |
---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 |
1 | 2 | 10/31/2010 | 0.450 |
2 | 3 | 11/30/2010 | 23.000 |
3 | 1 | 11/30/2010 | 0.870 |
4 | 2 | 11/30/2010 | 0.200 |
5 | 3 | 12/31/2010 | 0.340 |
6 | 1 | 12/31/2010 | 0.060 |
7 | 2 | 12/31/2010 | 0.001 |
8 | 3 | 12/31/2010 | -0.120 |
Date formats: “pd.to_datetime()” and “dt.to_period()”
Pandas offers a lot of flexibility to manipulate dates and time stamps. Much of this functionality can only be used on columns that have the Pandas “datetime” data type. We can convert dates to this data type using the .to_datetime()
function.
First, not that the date
column in our dataframe is of type “object”:
df.dtypes
firmid int64
date object
return float64
dtype: object
We’ll create a copy of the df
dataframe to avoid changing the original data in df
:
= df.copy() df2
'dtdate'] = pd.to_datetime(df2['date'])
df2[ df2
firmid | date | return | dtdate | |
---|---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 | 2010-09-30 |
1 | 2 | 10/31/2010 | 0.450 | 2010-10-31 |
2 | 3 | 11/30/2010 | 23.000 | 2010-11-30 |
3 | 1 | 11/30/2010 | 0.870 | 2010-11-30 |
4 | 2 | 11/30/2010 | 0.200 | 2010-11-30 |
5 | 3 | 12/31/2010 | 0.340 | 2010-12-31 |
6 | 1 | 12/31/2010 | 0.060 | 2010-12-31 |
7 | 2 | 12/31/2010 | 0.001 | 2010-12-31 |
8 | 3 | 12/31/2010 | -0.120 | 2010-12-31 |
df2.dtypes
firmid int64
date object
return float64
dtdate datetime64[ns]
dtype: object
Now we can apply many useful date functions (they usually have the prefix dt.
) to this datetime variable. For example, we can extract information about specific components of the date:
'year'] = df2['dtdate'].dt.year
df2['month'] = df2['dtdate'].dt.month
df2['day'] = df2['dtdate'].dt.day
df2[ df2
firmid | date | return | dtdate | year | month | day | |
---|---|---|---|---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 | 2010-09-30 | 2010 | 9 | 30 |
1 | 2 | 10/31/2010 | 0.450 | 2010-10-31 | 2010 | 10 | 31 |
2 | 3 | 11/30/2010 | 23.000 | 2010-11-30 | 2010 | 11 | 30 |
3 | 1 | 11/30/2010 | 0.870 | 2010-11-30 | 2010 | 11 | 30 |
4 | 2 | 11/30/2010 | 0.200 | 2010-11-30 | 2010 | 11 | 30 |
5 | 3 | 12/31/2010 | 0.340 | 2010-12-31 | 2010 | 12 | 31 |
6 | 1 | 12/31/2010 | 0.060 | 2010-12-31 | 2010 | 12 | 31 |
7 | 2 | 12/31/2010 | 0.001 | 2010-12-31 | 2010 | 12 | 31 |
8 | 3 | 12/31/2010 | -0.120 | 2010-12-31 | 2010 | 12 | 31 |
Another common use of the .to_datetime()
function is to construct a datetime variable from date components:
'newdate'] = pd.to_datetime(df2[['year','month','day']])
df2[ df2
firmid | date | return | dtdate | year | month | day | newdate | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 | 2010-09-30 | 2010 | 9 | 30 | 2010-09-30 |
1 | 2 | 10/31/2010 | 0.450 | 2010-10-31 | 2010 | 10 | 31 | 2010-10-31 |
2 | 3 | 11/30/2010 | 23.000 | 2010-11-30 | 2010 | 11 | 30 | 2010-11-30 |
3 | 1 | 11/30/2010 | 0.870 | 2010-11-30 | 2010 | 11 | 30 | 2010-11-30 |
4 | 2 | 11/30/2010 | 0.200 | 2010-11-30 | 2010 | 11 | 30 | 2010-11-30 |
5 | 3 | 12/31/2010 | 0.340 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 |
6 | 1 | 12/31/2010 | 0.060 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 |
7 | 2 | 12/31/2010 | 0.001 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 |
8 | 3 | 12/31/2010 | -0.120 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 |
The other, very commonly used type for date data is the Pandas period
format. This is used to specify that your data has a particular frequency, and can be done by applying the .to_period()
function to a datetime variable (e.g. use ‘Y’ for yearly frequency data, ‘M’ for monthly, and ‘Q’ for quarterly):
'yearly'] = df2['dtdate'].dt.to_period('Y')
df2['monthly'] = df2['dtdate'].dt.to_period('M')
df2['quarterly'] = df2['dtdate'].dt.to_period('Q')
df2[ df2
firmid | date | return | dtdate | year | month | day | newdate | yearly | monthly | quarterly | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 | 2010-09-30 | 2010 | 9 | 30 | 2010-09-30 | 2010 | 2010-09 | 2010Q3 |
1 | 2 | 10/31/2010 | 0.450 | 2010-10-31 | 2010 | 10 | 31 | 2010-10-31 | 2010 | 2010-10 | 2010Q4 |
2 | 3 | 11/30/2010 | 23.000 | 2010-11-30 | 2010 | 11 | 30 | 2010-11-30 | 2010 | 2010-11 | 2010Q4 |
3 | 1 | 11/30/2010 | 0.870 | 2010-11-30 | 2010 | 11 | 30 | 2010-11-30 | 2010 | 2010-11 | 2010Q4 |
4 | 2 | 11/30/2010 | 0.200 | 2010-11-30 | 2010 | 11 | 30 | 2010-11-30 | 2010 | 2010-11 | 2010Q4 |
5 | 3 | 12/31/2010 | 0.340 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 | 2010 | 2010-12 | 2010Q4 |
6 | 1 | 12/31/2010 | 0.060 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 | 2010 | 2010-12 | 2010Q4 |
7 | 2 | 12/31/2010 | 0.001 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 | 2010 | 2010-12 | 2010Q4 |
8 | 3 | 12/31/2010 | -0.120 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 | 2010 | 2010-12 | 2010Q4 |
These type of period
date are useful for many operations on the data, the most important one being that Pandas understand what you mean if you want to add or subtract some number of periods from a given date. For example:
'lastm'] = df2['monthly'] - 1
df2['nextq'] = df2['quarterly'] + 1
df2[ df2
firmid | date | return | dtdate | year | month | day | newdate | yearly | monthly | quarterly | lastm | nextq | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 | 2010-09-30 | 2010 | 9 | 30 | 2010-09-30 | 2010 | 2010-09 | 2010Q3 | 2010-08 | 2010Q4 |
1 | 2 | 10/31/2010 | 0.450 | 2010-10-31 | 2010 | 10 | 31 | 2010-10-31 | 2010 | 2010-10 | 2010Q4 | 2010-09 | 2011Q1 |
2 | 3 | 11/30/2010 | 23.000 | 2010-11-30 | 2010 | 11 | 30 | 2010-11-30 | 2010 | 2010-11 | 2010Q4 | 2010-10 | 2011Q1 |
3 | 1 | 11/30/2010 | 0.870 | 2010-11-30 | 2010 | 11 | 30 | 2010-11-30 | 2010 | 2010-11 | 2010Q4 | 2010-10 | 2011Q1 |
4 | 2 | 11/30/2010 | 0.200 | 2010-11-30 | 2010 | 11 | 30 | 2010-11-30 | 2010 | 2010-11 | 2010Q4 | 2010-10 | 2011Q1 |
5 | 3 | 12/31/2010 | 0.340 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 | 2010 | 2010-12 | 2010Q4 | 2010-11 | 2011Q1 |
6 | 1 | 12/31/2010 | 0.060 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 | 2010 | 2010-12 | 2010Q4 | 2010-11 | 2011Q1 |
7 | 2 | 12/31/2010 | 0.001 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 | 2010 | 2010-12 | 2010Q4 | 2010-11 | 2011Q1 |
8 | 3 | 12/31/2010 | -0.120 | 2010-12-31 | 2010 | 12 | 31 | 2010-12-31 | 2010 | 2010-12 | 2010Q4 | 2010-11 | 2011Q1 |
Sorting
We can sort a dataframe based on the values in a particular column using the .sort_values()
function. To sort based on the values in an index, we use the .sort_index()
function.
.sort_values()
Syntax:
=0, ascending=True, inplace=False) DataFrame.sort_values(by, axis
'firmid') df.sort_values(
firmid | date | return | |
---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 |
3 | 1 | 11/30/2010 | 0.870 |
6 | 1 | 12/31/2010 | 0.060 |
1 | 2 | 10/31/2010 | 0.450 |
4 | 2 | 11/30/2010 | 0.200 |
7 | 2 | 12/31/2010 | 0.001 |
2 | 3 | 11/30/2010 | 23.000 |
5 | 3 | 12/31/2010 | 0.340 |
8 | 3 | 12/31/2010 | -0.120 |
'firmid', ascending = False) df.sort_values(
firmid | date | return | |
---|---|---|---|
2 | 3 | 11/30/2010 | 23.000 |
5 | 3 | 12/31/2010 | 0.340 |
8 | 3 | 12/31/2010 | -0.120 |
1 | 2 | 10/31/2010 | 0.450 |
4 | 2 | 11/30/2010 | 0.200 |
7 | 2 | 12/31/2010 | 0.001 |
0 | 1 | 09/30/2010 | 0.050 |
3 | 1 | 11/30/2010 | 0.870 |
6 | 1 | 12/31/2010 | 0.060 |
Remember, functions that have an inplace
parameter do not actually change the original dataset unless we set that parameter to True
:
df
firmid | date | return | |
---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 |
1 | 2 | 10/31/2010 | 0.450 |
2 | 3 | 11/30/2010 | 23.000 |
3 | 1 | 11/30/2010 | 0.870 |
4 | 2 | 11/30/2010 | 0.200 |
5 | 3 | 12/31/2010 | 0.340 |
6 | 1 | 12/31/2010 | 0.060 |
7 | 2 | 12/31/2010 | 0.001 |
8 | 3 | 12/31/2010 | -0.120 |
'firmid', inplace = True)
df.sort_values( df
firmid | date | return | |
---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 |
3 | 1 | 11/30/2010 | 0.870 |
6 | 1 | 12/31/2010 | 0.060 |
1 | 2 | 10/31/2010 | 0.450 |
4 | 2 | 11/30/2010 | 0.200 |
7 | 2 | 12/31/2010 | 0.001 |
2 | 3 | 11/30/2010 | 23.000 |
5 | 3 | 12/31/2010 | 0.340 |
8 | 3 | 12/31/2010 | -0.120 |
.sort_index()
Abbreviated syntax:
=0, level=None, ascending=True, inplace=False) DataFrame.sort_index(axis
df.sort_index()
firmid | date | return | |
---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 |
1 | 2 | 10/31/2010 | 0.450 |
2 | 3 | 11/30/2010 | 23.000 |
3 | 1 | 11/30/2010 | 0.870 |
4 | 2 | 11/30/2010 | 0.200 |
5 | 3 | 12/31/2010 | 0.340 |
6 | 1 | 12/31/2010 | 0.060 |
7 | 2 | 12/31/2010 | 0.001 |
8 | 3 | 12/31/2010 | -0.120 |
Lagging and leading
In the context of most financial datasets which contain time-indexed information, lagging a particular variable (column) means obtaining the values for that particular variable from a prior point in time. Leading a variable means obtaining values from a future point in time.
If you research how to lead and lag variables in with pandas dataframes, most sources (including the official Pandas user guide) claims that you can do this using the .shift()
function. In this section I first show you how to use this method, and when it runs into problems. Then show you how you can create leads and lags in a more robust way (using period
dates and the merge
function).
Lagging and leading using .shift()
Syntax:
=1, freq=None, axis=0, fill_value=NoDefault.no_default) DataFrame.shift(periods
Suppose we want to create a new column lag_return
which tells us, for each firm, its returns from the prior month. The general advice you’ll see is that you should first sort your dataframe by firm identifier and by date:
= df.sort_values(['firmid','date']).copy()
df2 df2
firmid | date | return | |
---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 |
3 | 1 | 11/30/2010 | 0.870 |
6 | 1 | 12/31/2010 | 0.060 |
1 | 2 | 10/31/2010 | 0.450 |
4 | 2 | 11/30/2010 | 0.200 |
7 | 2 | 12/31/2010 | 0.001 |
2 | 3 | 11/30/2010 | 23.000 |
5 | 3 | 12/31/2010 | 0.340 |
8 | 3 | 12/31/2010 | -0.120 |
And then use the .shift()
function, after you tell Python that your dates are grouped at the firm level (each firm identifier has its own set of dates):
'lag_ret'] = df2.groupby('firmid')['return'].shift(1)
df2[ df2
firmid | date | return | lag_ret | |
---|---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 | NaN |
3 | 1 | 11/30/2010 | 0.870 | 0.05 |
6 | 1 | 12/31/2010 | 0.060 | 0.87 |
1 | 2 | 10/31/2010 | 0.450 | NaN |
4 | 2 | 11/30/2010 | 0.200 | 0.45 |
7 | 2 | 12/31/2010 | 0.001 | 0.20 |
2 | 3 | 11/30/2010 | 23.000 | NaN |
5 | 3 | 12/31/2010 | 0.340 | 23.00 |
8 | 3 | 12/31/2010 | -0.120 | 0.34 |
Note that the entries for lag_ret
on the second row and the last row are not correct: they do not tell us what the return of the firm was in the prior month. This happens because
- Our data has gaps in coverage (October 2010 is missing for firmid==1)
- Our data has duplicates (there are two entries for December 2010 for firmid==3)
Note that both of these issues disappear if we first get rid of duplicates and if we reinterpret “lagging” to mean “the last available data point” not “data from the last calendar period” and “leading” to mean “the next available data point” and not “data from the following calendar period”. To keep things simple, this is the approach we will take in this course.
However, if this reinterpretation of lagging and leading is not exactly what you need for your application and you need to lag and lead in terms of calendar periods, you should follow the approach below:
Leading and lagging with .merge()
First, again, create a copy of the original dataset:
= df.sort_values(['firmid','date']).copy()
df2 df2
firmid | date | return | |
---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 |
3 | 1 | 11/30/2010 | 0.870 |
6 | 1 | 12/31/2010 | 0.060 |
1 | 2 | 10/31/2010 | 0.450 |
4 | 2 | 11/30/2010 | 0.200 |
7 | 2 | 12/31/2010 | 0.001 |
2 | 3 | 11/30/2010 | 23.000 |
5 | 3 | 12/31/2010 | 0.340 |
8 | 3 | 12/31/2010 | -0.120 |
We will first create a new date variable that tells Python the frequency of our dates:
'mdate'] = pd.to_datetime(df2['date']).dt.to_period('M')
df2[ df2
firmid | date | return | mdate | |
---|---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 | 2010-09 |
3 | 1 | 11/30/2010 | 0.870 | 2010-11 |
6 | 1 | 12/31/2010 | 0.060 | 2010-12 |
1 | 2 | 10/31/2010 | 0.450 | 2010-10 |
4 | 2 | 11/30/2010 | 0.200 | 2010-11 |
7 | 2 | 12/31/2010 | 0.001 | 2010-12 |
2 | 3 | 11/30/2010 | 23.000 | 2010-11 |
5 | 3 | 12/31/2010 | 0.340 | 2010-12 |
8 | 3 | 12/31/2010 | -0.120 | 2010-12 |
Now create a new dataframe containing the firm identifiers (firmid
), the period date (mdate
) and the variable we want to lag (return
):
= df2[['firmid','mdate','return']].copy()
lags lags
firmid | mdate | return | |
---|---|---|---|
0 | 1 | 2010-09 | 0.050 |
3 | 1 | 2010-11 | 0.870 |
6 | 1 | 2010-12 | 0.060 |
1 | 2 | 2010-10 | 0.450 |
4 | 2 | 2010-11 | 0.200 |
7 | 2 | 2010-12 | 0.001 |
2 | 3 | 2010-11 | 23.000 |
5 | 3 | 2010-12 | 0.340 |
8 | 3 | 2010-12 | -0.120 |
Now add (subtract in case of leads) the number of periods you want to lag the return variable (1 in our example) to the period date:
'mdate'] = lags['mdate'] + 1
lags[ lags
firmid | mdate | return | |
---|---|---|---|
0 | 1 | 2010-10 | 0.050 |
3 | 1 | 2010-12 | 0.870 |
6 | 1 | 2011-01 | 0.060 |
1 | 2 | 2010-11 | 0.450 |
4 | 2 | 2010-12 | 0.200 |
7 | 2 | 2011-01 | 0.001 |
2 | 3 | 2010-12 | 23.000 |
5 | 3 | 2011-01 | 0.340 |
8 | 3 | 2011-01 | -0.120 |
And rename return
to lag_return
:
= lags.rename(columns={'return':'lag_return'})
lags lags
firmid | mdate | lag_return | |
---|---|---|---|
0 | 1 | 2010-10 | 0.050 |
3 | 1 | 2010-12 | 0.870 |
6 | 1 | 2011-01 | 0.060 |
1 | 2 | 2010-11 | 0.450 |
4 | 2 | 2010-12 | 0.200 |
7 | 2 | 2011-01 | 0.001 |
2 | 3 | 2010-12 | 23.000 |
5 | 3 | 2011-01 | 0.340 |
8 | 3 | 2011-01 | -0.120 |
Finally, merge this lagged data into the original dataset:
= df2.merge(lags, how='left', on=['firmid','mdate']) df2
'firmid','mdate']) df2.sort_values([
firmid | date | return | mdate | lag_return | |
---|---|---|---|---|---|
0 | 1 | 09/30/2010 | 0.050 | 2010-09 | NaN |
1 | 1 | 11/30/2010 | 0.870 | 2010-11 | NaN |
2 | 1 | 12/31/2010 | 0.060 | 2010-12 | 0.87 |
3 | 2 | 10/31/2010 | 0.450 | 2010-10 | NaN |
4 | 2 | 11/30/2010 | 0.200 | 2010-11 | 0.45 |
5 | 2 | 12/31/2010 | 0.001 | 2010-12 | 0.20 |
6 | 3 | 11/30/2010 | 23.000 | 2010-11 | NaN |
7 | 3 | 12/31/2010 | 0.340 | 2010-12 | 23.00 |
8 | 3 | 12/31/2010 | -0.120 | 2010-12 | 23.00 |
Note that now the second and last entry in lag_return
are correct.
You still have to contend with what it means that you have duplicate entries for return
for December 2010 for firmid==3
but dealing with duplicates needs to be addressed on a case by case basis, depending on the particulars of the data you are using.